Background
119
take nearly four times longer than FP model training. The slow training time undoubtedly
affects the easiness of industrial language models. Second, conducting QAT on memory-
limited devices is sometimes prohibited due to the increasing size of large language models.
As demonstrated in [5], the QAT method [285] even consumes 8.3 GB more memory than
FP when trained with knowledge distillation. On the contrary, PTQ methods can conduct
quantization by only caching the intermediate results of each layer, which can be fed into
memory-limited training devices. Third, the training set is sometimes inaccessible due to
industry data security or privacy issues. In contrast, PTQ constructs the small calibration
set by sampling only 1K ∼4K instances from the whole training set.
In summary, PTQ is an appealing, efficient alternative in training time, memory over-
head, and data consumption. Generally, instead of the whole training set, PTQ methods
leverage only a small portion of training data to minimize the layer-wise reconstruction error
incurred by quantization [101, 179, 180]. The layer-wise objective breaks down the end-to-
end training, solving the quantization optimization problem in a more sample-efficient [297]
and memory-saving way. Nonetheless, it is non-trivial to directly apply previous PTQ meth-
ods for language models such as BERT [54], as the performance drops sharply. For this
reason, some efforts are investigated to improve performance.
5.1.3
Binary BERT Pre-Trained Models
Recent pre-trained BERT models have advanced the state-of-the-art performance in vari-
ous natural language tasks [227, 55]. Nevertheless, deploying BERT models on resource-
constrained edge devices is challenging due to the massive parameters and floating-
point operations (FLOPs), limiting the application of pre-trained BERT models. To mit-
igate this, model compression techniques are widely studied and applied for deploy-
ing BERTs in resource-constrained and real-time scenarios, including knowledge distilla-
tion [206, 217, 106], parameter pruning [172, 64], low-rank approximation [166, 126], weight
sharing [50, 126, 98], dynamic networks with adaptive depth and/or width [89, 255], and
quantization [280, 208, 65, 285].
Among all these model compression approaches, quantization, which utilizes lower bit-
width representation for model parameters, emerges as an efficient way to deploy compact
BERT models on edge devices. Theoretically, it compresses the model by replacing each
32-bit floating-point parameter with a low-bit fixed-point representation. Existing attempts
try to quantize pre-trained BERT [280, 208, 65] to even as low as ternary values (2-bit) with
minor performance drop [285]. More aggressively, binarization of the weights and activations
of BERT [6, 195, 222, 156, 40] could bring at most 32× reduction in model sizes and replace
most floating-point multiplications with additions, which significantly alleviate the huge
parameter and FLOPs burden.
Network binarization is first proposed in [48] and has been extensively studied in the
academia [199, 99, 159]. For BERT binarization, a general workflow is to binarize the rep-
resentation in BERT architecture in the forward propagation and apply distillation to the
optimization in the backward propagation. In detail, the forward and backward propagation
of sign function in binarized network can be formulated as:
Forward: sign(x) =
1
if x ≥0
−1
otherwise ,
(5.1)
Backward: ∂C
∂x =
∂C
∂sign(x)
if |x| ≤1
0
otherwise
,
(5.2)
where x is the input and C is the cost function for the minibatch. sign(·) function is applied in
the forward propagation while the straight-through estimator (STE) [9] is used to obtain the